November 19, 2019

Plan for today

  • Text formats and encoding
  • “Text as data” - why use text?
  • NLP workflow
  • Regular expressions
  • “Ask me anything”

Text formats

Revisited: Basic units of data

  • Bits
    • Smallest unit of storage; a 0 or 1
    • With n bits, can store \(2^n\) patterns
  • Bytes
    • 8 bits = 1 byte (why 1 byte can store 256 patterns)
    • ``eight bit encoding’’ - used to represent characters, such as represented as = 01000001

ASCII

Encoding

Solution: Unicode

  • Unicode was developed to provide a unique number (a “code point’’) to every known character – even some that are”unknown"
  • problem: there are more far code points than fit into 8-bit encodings. Hence there are multiple ways to encode the Unicode code points
  • variable-byte encodings use multiple bytes as needed. Advantage is efficiency, since most ASCII and simple extended character sets can use just one byte, and these were set in the Unicode standard to their ASCII and ISO-8859 equivalents
  • two most common are UTF-8 and UTF-16, using 8 and 16 bits respectively

Things to watch out for

  • Input texts can be very different
  • Many text production software (e.g. MS Office-based products) still tend to use proprietary formats, such as Windows-1252
  • Windows tends to use UTF-16, while Mac and other Unix-based platforms use UTF-8
  • Your eyes can be deceiving: a client may display gibberish but the encoding might still be as intended
  • No easy method of detecting encodings (except in HTML meta-data)

Document formats

  • Many different formats contain text
  • How many can you think of?
  • What problems are encountered?

Why use text?

Measuring unobserverables

  • psychological states
  • sentiment
  • “topics”
  • Ideology: “left-right” policy positions
  • corruption
  • cultural values
  • power

Example: budget debate

Textual statistics

  • Computed from lexical features, typically
  • examples are:
    • frequency analysis
    • readability
    • lexical diversity

Example: “Readability” on US Presidential Speeches

  • A corpus of State of the Union addresses, given each January by the US President
  • Uses a descriptive index of “reading ease”, developed in educational science, known as the Flesch-Kincaid measure
  • Widely identified trend that this is greatly increasing over time: SOTU addresses are getting simpler

Visualized

But not a general trend

  • Only appears in US presidential State-of-the-Union addresses
  • Partly an effect of changing audiences and the mode of delivery: from written to spoken form
  • Did not hold true for other forms of political communication, either in the US or comparatively

Differences in Reading Ease by Delivery Method of Speech

Not evident in other US texts

Not evident in other contexts

Prediction

“Ground truth” (for supervised learning)

  • Human annotation: Part of the research process
    • Authors
    • Students
    • Crowd workers
  • Accompanies the data: Part of the data generation process
    • Star ratings for reviews - most movie sentiment databases (Pang, Lee, and Vaithyanathan 2002)
    • Votes for legislative bills - Thomas, Pang and Lee (2006)

Examples

Examples

Not always obvious

  • text is not like cats v. dogs
  • sometimes not even cats v. dogs is clear

Workflow

Text analysis workflow

Tokenization

Dictionary approaches

  • Map each word or phrase to a “dictionary” of words, associated with a known “sentiment” or psychological state
  • Treats matches within each dictionary as equivalent
  • Dictionary choice is important, but the computer does all of the counting – so perfectly reliable
  • Examples: Linguistic Inquiry and Word Count, or the General Inquirer

Dictionary example (from LIWC 2015)

Dictionary object with 1 key entry.
- [posemo]:
- like, like*, :), (:, accept, accepta*, accepted, accepting, accepts, active, …
interests, invigor*, joke*, joking, jolly, joy*, keen*, kidding, 
kind, kindly, kindn*, kiss*, laidback, laugh*, legit, libert*, 
likeab*, liked, likes, liking, livel*, lmao*, lmfao*, lol, love, loved, lovelier, ...

Problems

  • “polysemy” – multiple meanings. kind has three!
  • From State of the Union corpus: 318 matches
    • kind/NOUN – 95%
    • kind (of)/ADVERB – 1%
    • kind/ADJECTIVE – 4%
  • These are known as false positives
  • Other problem: false negatives (what we miss)
    • Missed: kindliness
    • Also missed: altruistic and magnanimous

Regular expressions

Regular expressions

Matching

  • simplest: match exact “strings”
library("stringr")
x <- c("apple", "banana", "pear")
str_extract(x, "an")
## [1] NA   "an" NA

Matching - case sensitive

  • case is an option
bananas <- c("banana", "Banana", "BANANA")
str_detect(bananas, "banana")
## [1]  TRUE FALSE FALSE
str_detect(bananas, regex("banana", ignore_case = TRUE))
## [1] TRUE TRUE TRUE

Wildcards

  • . matches any character
  • * matches no or more of the preceding character
  • + matches one or more of the preceding character
  • () allow grouping
  • [] defines character classes
  • \p{} defines categories

For more details

  • R package stringr
  • R package stringi
  • base grep() is deprecated but possible to use

And don’t forget the humble “glob”

  • beginning and end of match are built into the pattern
  • * means no or any characters
  • ? means any single character
  • built into quanteda: ?quanteda::valuetype

Back to the dictionary example

Dictionary object with 1 key entry.
- [posemo]:
- like, like*, :), (:, accept, accepta*, accepted, accepting, accepts, active, …
interests, invigor*, joke*, joking, jolly, joy*, keen*, kidding, 
kind, kindly, kindn*, kiss*, laidback, laugh*, legit, libert*, 
likeab*, liked, likes, liking, livel*, lmao*, lmfao*, lol, love, loved, lovelier, ...

How do you do more with text?

Take MY429: Quantitative Text Analysis (Lent Term)

Week 8 Lab

Textual data hackathon!